Using bilingual corpora for the construction of contrastive generation grammars: issues and problems
نویسنده
چکیده
This paper reports on the use of corpora for the construction of a computational grammar of Spanish, contrastive with English, in the application context of Multilingual Natural Language Generation (MLG). The theoretical framework for this work is Systemic Functional Linguistics (SFL) and the computational context provided by KPML (Komet Penman Multilingual), an extensive grammar development environment and generation engine that supports large-scale multilingual development (Bateman 1997). The initial phenomena which are being investigated contrastively belong to three different functional regions of the grammar, i.e., particular subareas of the grammar that are concerned with particular areas of meanings. These regions are transitivity (ideational meaning), thematicity (textual meaning) and mood (interpersonal meaning). The present study concentrates on textual meaning (thematicity and focus) as an illustration. Following what has now established itself as a standard methodology for empirically-based Natural Language Generation (Bateman 1998a, Reiter and Dale 1997), the following steps were carried out: first, a bilingual corpus (English-Spanish) was selected. For Spanish, a sample of spoken texts from the MacroCorpus of the educated linguistic standard of the main cities of the Spanish-speaking world was used, while for English a comparable sample was selected from the British National Corpus Sampler. This was motivated by the need to provide a realistic account of the behaviour of the linguistic phenomena investigated in unplanned and spontaneous contexts of use. The second step was to carry out a contrastive analysis of the phenomena mentioned before. Finally, the results of the analysis were coded up as resources/processes for generation. In the case of Spanish, these had to be created anew. In the case of English, as the KPML already includes an English generation grammar, this last step consisted on checking the coverage of the existing specifications and extending them when could not cover the instances found in the corpus, and adapting them for effective MLG. Given the nature of the NLG process, which typically converts communicative goals expressed in some internal representation into surface forms, the kind of information that is most readily usable for NLG are statements of mappings from functions to forms. Therefore, the corpus analysis phase for NLG usually includes an explicit, and usually quite lengthy linguistic analysis where the analyst seeks possible realisations of communicative functions, which restricts the size of the corpus that can be realistically considered. This paper describes the different steps carried out for the generation of the linguistic phenomena mentioned above, …
منابع مشابه
Morphosyntactic Analysis of the CHILDES and TalkBank Corpora
This paper describes the construction and usage of the MOR and GRASP programs for part of speech tagging and syntactic dependency analysis of the corpora in the CHILDES and TalkBank databases. We have written MOR grammars for 11 languages and GRASP analyses for three. For English data, the MOR tagger reaches 98% accuracy on adult corpora and 97% accuracy on child language corpora. The paper dis...
متن کاملUsing sign language corpora as bilingual corpora for data mining: Contrastive linguistics and computer-assisted..
More and more sign languages nowadays are now documented by large scale digital corpora. But exploiting sign language (SL) corpus data remains subject to the time consuming and expensive manual task of annotating. In this paper, we present an ongoing research that aims at testing a new approach to better mine SL data. It relies on the methodology of corpus-based contrastive linguistics, exploit...
متن کاملPreparation and exploitation of bilingual texts
A bitext is a merged document composed of two versions of a given text, usually in two different languages. An aligned bitext is produced by an alignment tool or aligner, that automatically aligns or matches the versions of the same text, generally sentence by sentence. A multilingual aligned corpus or collection of aligned bitexts, when consulted with a search tool, can be extremely useful for...
متن کاملTranslation and contrastive linguistic studies at the interface of English and Chinese: Significance and implications
Corpora have revolutionized nearly all areas of linguistic research over the past four decades (McEnery, Xiao and Tono 2006; McEnery and Hardie 2012). Translation studies and contrastive linguistics are no exceptions. Indeed, the rapid development of bilingual parallel corpora as well as monolingual and multilingual comparable corpora since the early 1990s has been of particular relevance and c...
متن کاملStochastic Inversion Transduction Grammars, with Application to Segmentation, Bracketing, and Alignment of Parallel Corpora
We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and (2) the concept of bilingual parsing with potential application to a variety of parallel corpus analysis problems. The formalism combines three tactics against the constraints that render finite-state transducers less useful: it skips directly to a context-free rat...
متن کامل